Curation Micro-Services: A Pipeline Metaphor for Repositories
نویسندگان
چکیده
The effective long-term curation of digital content requires expert analysis, policy setting, and decision making, and a robust technical infrastructure that can effect and enforce curation policies and implement appropriate curation activities. Since the number, size, and diversity of content under curation management will undoubtedly continue to grow over time, and the state of curation understanding and best practices relative to that content will undergo a similar constant evolution, one of the overarching design goals of a sustainable curation infrastructure is flexibility. In order to provide the necessary flexibility of deployment and configuration in the face of potentially disruptive changes in technology, institutional mission, and user expectation, a useful design metaphor is provided by the Unix pipeline, in which complex behavior is an emergent property of the coordinated action of a number of simple independent components. The decomposition of repository function into a highly granular and orthogonal set of independent but interoperable micro-services is consistent with the principles of prudent engineering practice. Since each micro-service is small and self-contained, they are individually more robust and collectively easier to implement and maintain. By being freely interoperable in various strategic combinations, any number of micro-services-based repositories can be easily constructed to meet specific administrative or technical needs. Importantly, since these repositories are purposefully built from policy neutral and protocol and platform independent components to provide the function minimally necessary for a specific context, they are not constrained to conform to an infrastructural monoculture of prepackaged repository solutions. The University of California Curation Center has developed an open source micro-services infrastructure that is being used to manage the diverse digital collections of the ten campus University system and a number of non-university content partners. This paper provides a review of the conceptual design and technical implementation of this micro-services environment, a case study of initial deployment, and a look at ongoing micro-services developments. Introduction Information technology and resources have become integral and indispensable to the pedagogic mission of the University of California, with members of the UC community routinely producing and utilizing a wide variety of digital assets in their teaching, learning, and research activities. These assets represent the intellectual capital of the University; they have inherent enduring value and need to be managed carefully to ensure that they will remain available for use by future scholars. Within the UC system the newlyestablished UC Curation Center (UC3), one of five programmatic units of the California Digital Library, has a broad mandate to provide innovative solutions that ensure the long-term usability of the University’s digital assets. While curation is not solely a technical undertaking – curation success is, for example, highly dependent on important human competencies, analysis, and decision making – a robust infrastructure in which to manage valuable digital content efficiently and effectively is nevertheless a necessary foundation. Paper Proposal for OR 2010 Abrams, Cruse, Kunze, and Minor Curation Micro-services: A Pipeline Metaphor for Repositories Page 2 of 4 As a central system-wide service provider to the ten UC campuses, UC3 is routinely asked to assume custodial stewardship for digital content in ever increasing number, size, and diversity of type. Furthermore, this content is often used and repurposed in novel contexts far removed from the intention of its original creators. Thus, the programmatic imperative of UC3 is to provide a curation environment that is comprehensive in scope, yet flexible with regard to local policies and practices, and the inevitability of disruptive changes in technology and user expectation. To meet these goals the UC3 infrastructure is based on the idea of micro-services, the decomposition of repository function into a highly granular and orthogonal set of independent but interoperable components that can be freely composed in strategic combinations towards useful ends. The paradigmatic metaphor for the micro-services approach is the Unix pipeline. The Pipeline Metaphor The pipeline concept was first proposed by Douglas McIlroy in 1964 and gained wide visibility through its integration in the Unix operating system in 1973 (Ritchie, 1980). A pipeline chains together a set of independent processes such that the output of a previous process becomes the input to a subsequent process. Although the local function of individual components can be extremely narrowly scoped, sophisticated global behavior is nevertheless an emergent property of the coordinated action. Due to the process coupling at the I/O level, pipelines are highly dependent on the stability of the public interface “contracts” exposed by the component processes. The flexibility inherent to a pipeline serves a number of important purposes. By decomposing complex function into a set of simple constituent parts, the development and maintenance of those parts is simplified. This approach is consistent with prudent engineering practice as articulated in forms as varied as the philosophical statement of Occam’s Razor (“entia non sunt multiplicanda praeter necessitate [entities must not be multiplied beyond necessity]”; Wikipedia 2010a) to the popular culture adage of Murphy’s Law (“whatever can go wrong, will go wrong”; Wikipedia 2010b). The design principles underlying the pipeline metaphor have been generalized by UC3 into a preference for the small and simple over the large and complex, the minimally sufficient over the feature laden, the fully configurable over the prescribed, and the proven over the (merely) novel. The advantages of the micro-services approach to curation infrastructure are manifold. Since each microservice is small and self-contained, they are individually more robust and collectively easier to implement and maintain. Since the level of resource investment in any given service is small, the level of institutional commitment to that service is concomitantly small, so they are easier to deprecate and replace when they have outlived their usefulness; an important consideration given that curation over archival time-spans is best seen as a relay requiring periodic handoffs between a constantly evolving ecosystem of services and service providers (Janée et al. 2008). Since the micro-services are inherently amenable to flexible and strategic recombination, many purpose-built repositories can be easily constructed with the minimally necessary function for a specific administrative or technical purpose. Design and Implementation The initial repertoire of micro-services coalesces into four hierarchical levels (see Figure 1). The range of underlying function moves from preservation necessity towards curation sufficiency by maintaining the integrity of content state, managing content context, providing user-facing services, and enabling the enhancement of value. Paper Proposal for OR 2010 Abrams, Cruse, Kunze, and Minor Curation Micro-services: A Pipeline Metaphor for Repositories Page 3 of 4 Curation Value Annotation of content by consumers Notification of new content availability Service Transformation to create derivatives Search of content and metadata Index to enable fast search Ingest of content for curation Preservation Context Characterization to extract content properties Inventory of curated content State Replication for safety Fixity to verify bit-level integrity Storage for long-term retention Identity for long-term reference Figure 1 – Curation micro-services The general principles of granularity and orthogonality are applied throughout the architecture, with each micro-service itself built up from smaller components. For example, the Storage service is modeled in terms of five conceptual entities: the service itself, which acts as a broker to an arbitrary number of storage nodes, each of which manages a storage sub-domain established to meet specific policy, administrative, or technical needs. Nodes manage digital objects, which can encapsulate an arbitrary number of versions, each of which is a set of files representing a discrete state of the object. (As a corollary, any change introduced to object state instantiates a new object version. Previous states are stored as a sequence of reverse deltas to minimize storage utilization yet support the easy re-instantiation of an arbitrary version.) Subsidiary systems and specifications for these entities include Content Access Node (CAN), Pairtree, Dflat, Checkm, and Reverse Directory Deltas (ReDD). (More information is available at .) All conceptual entities are defined in terms of a set of state properties and behaviors that can manipulate that state. Entity state information follows the Linked Data paradigm in including actionable links to related entities, when relevant (Bizer et al. 2007). For example, a version contains a back link to its object and forward links to all of its files. State properties are defined as semantic ontologies and can be reported in various expressions including ANVL (mail header-like name/value pairs), JSON, RDF/Turtle, RDF/XML, XHTML, and XML. Behaviors are first defined as abstract methods that are then mapped to specific interactive modalities. In general, service methods can be invoked through a RESTful API, a command line API, or a procedural interface with various language bindings (currently, either Java or Perl). The combinatoric power of the micro-services approach is illustrated by the ingest workflow that coordinates the actions of four components: Ingest, Identity, Storage (with subsidiary invocation of encapsulated storage nodes), and Inventory (see Figure 2). The Inventory service manages a triple store-based metadata catalog for all managed content. This catalog is intended as an optimization to support administrative and technical queries, and in general is a duplicative subset of the authoritative metadata that is expressed in files managed by the Storage service. Thus the Inventory catalog can always be fully reinstantiated, if necessary, from the metadata-of-record in the Storage service. Conclusion In order to facilitate the application of UC Curation Center service offerings to new campus constituencies, and to respond to the increasing number, size, and type diversity of digital content, the underlying curation Paper Proposal for OR 2010 Abrams, Cruse, Kunze, and Minor Curation Micro-services: A Pipeline Metaphor for Repositories Page 4 of 4 infrastructure must be easily adaptable to local needs and practices. An architectural approach based on the principles underlying the pipeline metaphor in which curation function is embodied in a set of granular and orthogonal micro-services best provides the necessary deployment flexibility, while also simplifying development and maintenance effort. Service interoperability is facilitated by strict conformance to the behavioral semantics of well-defined public interfaces. This permits comprehensive curation function to emerge from the strategic combination of individual atomistic services.
منابع مشابه
Archivematica: Using Micro-Services And Open-Source Software To Deliver A Comprehensive Digital Curation Solution
Digital curation micro-services offer a light-weight alternative to preservation systems that are developed on digital repository and framework technology stacks. These are often too complex for small and medium-sized memory institutions to deploy and maintain. The Archivematica project has implemented a micro-services approach to develop an integrated suite of free and open-source tools that a...
متن کاملMetadata for a Micro-services-based Digital Curation System
The Libraries and Information Technology Services at the Pennsylvania State University are in the process of developing a service architecture for supporting digital curation and preservation activity at the university. This system, called Curation Architecture Prototype Services (CAPS), is built on the micro-services approach to digital curation pioneered by the California Digital Library. The...
متن کاملCo-operation for digital preservation and curation:
The digital preservation problem is a series of interrelated technical and organizational challenges that can only be met co-operatively by the many different stakeholders that are involved. The rise of the institutional repository paradigm backs this up with its focus on co-operation within national or subjectbased networks and the wider positioning of repositories within modular service frame...
متن کاملMORe: A Micro-service Oriented Aggregator
Metadata aggregation is a task increasingly encountered in many projects involving data repositories. The small number of specialized software for this task indicates that in most cases customized software is used to perform aggregation, which in turn relates to the highly complex tasks and architectures involved. In this paper, the metadata and object repository aggregator (MORe) is presented,...
متن کاملAn Emergent Micro-Services Approach to Digital Curation Infrastructure
In order better to meet the needs of its diverse University of California (UC) constituencies, the California Digital Library UC Curation Center is re-envisioning its approach to digital curation infrastructure by devolving function into a set of granular, independent, but interoperable microservices. Since each of these services is small and self-contained, they are more easily developed, depl...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- J. Digit. Inf.
دوره 12 شماره
صفحات -
تاریخ انتشار 2011